{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Agenda\n",
      "\n",
      "- Define the problem and the approach\n",
      "- Data basics: loading data, looking at your data, basic commands\n",
      "- Handling missing values\n",
      "- Intro to scikit-learn\n",
      "- <p style=\"color: red\">Grouping and aggregating data</p>\n",
      "- Feature selection\n",
      "- Fitting and evaluating a model\n",
      "- Deploying your work"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#In this workbook you will\n",
      "- Receive and overview of the `apply` function\n",
      "- Write custom functions for analyzing data in `pandas`\n",
      "- Do SQL style joins on your data frames"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import numpy as np\n",
      "import pylab as pl"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = pd.read_csv(\"./data/credit-data-trainingset.csv\")\n",
      "df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>serious_dlqin2yrs</th>\n",
        "      <th>revolving_utilization_of_unsecured_lines</th>\n",
        "      <th>age</th>\n",
        "      <th>number_of_time30-59_days_past_due_not_worse</th>\n",
        "      <th>debt_ratio</th>\n",
        "      <th>monthly_income</th>\n",
        "      <th>number_of_open_credit_lines_and_loans</th>\n",
        "      <th>number_of_times90_days_late</th>\n",
        "      <th>number_real_estate_loans_or_lines</th>\n",
        "      <th>number_of_time60-89_days_past_due_not_worse</th>\n",
        "      <th>number_of_dependents</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0.766127</td>\n",
        "      <td> 45</td>\n",
        "      <td> 2</td>\n",
        "      <td> 0.802982</td>\n",
        "      <td>  9120</td>\n",
        "      <td> 13</td>\n",
        "      <td> 0</td>\n",
        "      <td> 6</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.957151</td>\n",
        "      <td> 40</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0.121876</td>\n",
        "      <td>  2600</td>\n",
        "      <td>  4</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.658180</td>\n",
        "      <td> 38</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0.085113</td>\n",
        "      <td>  3042</td>\n",
        "      <td>  2</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.907239</td>\n",
        "      <td> 49</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0.024926</td>\n",
        "      <td> 63588</td>\n",
        "      <td>  7</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.213179</td>\n",
        "      <td> 74</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0.375607</td>\n",
        "      <td>  3500</td>\n",
        "      <td>  3</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "   serious_dlqin2yrs  revolving_utilization_of_unsecured_lines  age  \\\n",
        "0                  1                                  0.766127   45   \n",
        "1                  0                                  0.957151   40   \n",
        "2                  0                                  0.658180   38   \n",
        "3                  0                                  0.907239   49   \n",
        "4                  0                                  0.213179   74   \n",
        "\n",
        "   number_of_time30-59_days_past_due_not_worse  debt_ratio  monthly_income  \\\n",
        "0                                            2    0.802982            9120   \n",
        "1                                            0    0.121876            2600   \n",
        "2                                            1    0.085113            3042   \n",
        "3                                            1    0.024926           63588   \n",
        "4                                            0    0.375607            3500   \n",
        "\n",
        "   number_of_open_credit_lines_and_loans  number_of_times90_days_late  \\\n",
        "0                                     13                            0   \n",
        "1                                      4                            0   \n",
        "2                                      2                            1   \n",
        "3                                      7                            0   \n",
        "4                                      3                            0   \n",
        "\n",
        "   number_real_estate_loans_or_lines  \\\n",
        "0                                  6   \n",
        "1                                  0   \n",
        "2                                  0   \n",
        "3                                  1   \n",
        "4                                  1   \n",
        "\n",
        "   number_of_time60-89_days_past_due_not_worse  number_of_dependents  \n",
        "0                                            0                     2  \n",
        "1                                            0                     1  \n",
        "2                                            0                     0  \n",
        "3                                            0                     0  \n",
        "4                                            0                     1  "
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Apply\n",
      "\"Applies\" or operates on a column in your data frame with a given function. This is analagous to an Excel formula."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.monthly_income.apply(np.log)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "0      9.118225\n",
        "1      7.863267\n",
        "2      8.020270\n",
        "3     11.060180\n",
        "4      8.160518\n",
        "5      8.160518\n",
        "6     10.072555\n",
        "7      7.824046\n",
        "8      8.779711\n",
        "9      9.429797\n",
        "10         -inf\n",
        "11     9.338030\n",
        "12     9.082507\n",
        "13     8.095599\n",
        "14     5.808142\n",
        "...\n",
        "112400    9.752665\n",
        "112401    8.329658\n",
        "112402    7.170120\n",
        "112403    9.176370\n",
        "112404    8.832442\n",
        "112405    7.955074\n",
        "112406    8.081784\n",
        "112407    7.899895\n",
        "112408    8.081784\n",
        "112409    8.702344\n",
        "112410    9.528794\n",
        "112411    8.702344\n",
        "112412    7.675546\n",
        "112413    8.235095\n",
        "112414    9.210340\n",
        "Name: monthly_income, Length: 112415, dtype: float64"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Applying with lambda functions\n",
      "A `lambda` function is an anonymous function. Think of it just as a shorthand way to define a quick function that you need once."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "add_10 = lambda x: x + 10\n",
      "plus = lambda x, y: x + y\n",
      "\n",
      "print add_10(9)\n",
      "print plus(10, 20)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "19\n",
        "30\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.monthly_income.apply(lambda x: np.log(x + 1))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "0      9.118335\n",
        "1      7.863651\n",
        "2      8.020599\n",
        "3     11.060196\n",
        "4      8.160804\n",
        "5      8.160804\n",
        "6     10.072597\n",
        "7      7.824446\n",
        "8      8.779865\n",
        "9      9.429877\n",
        "10     0.000000\n",
        "11     9.338118\n",
        "12     9.082621\n",
        "13     8.095904\n",
        "14     5.811141\n",
        "...\n",
        "112400    9.752723\n",
        "112401    8.329899\n",
        "112402    7.170888\n",
        "112403    9.176473\n",
        "112404    8.832588\n",
        "112405    7.955425\n",
        "112406    8.082093\n",
        "112407    7.900266\n",
        "112408    8.082093\n",
        "112409    8.702510\n",
        "112410    9.528867\n",
        "112411    8.702510\n",
        "112412    7.676010\n",
        "112413    8.235361\n",
        "112414    9.210440\n",
        "Name: monthly_income, Length: 112415, dtype: float64"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#numpy actually has log(x + 1)\n",
      "help(np.log1p)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Help on ufunc object:\n",
        "\n",
        "log1p = class ufunc(__builtin__.object)\n",
        " |  Functions that operate element by element on whole arrays.\n",
        " |  \n",
        " |  To see the documentation for a specific ufunc, use np.info().  For\n",
        " |  example, np.info(np.sin).  Because ufuncs are written in C\n",
        " |  (for speed) and linked into Python with NumPy's ufunc facility,\n",
        " |  Python's help() function finds this page whenever help() is called\n",
        " |  on a ufunc.\n",
        " |  \n",
        " |  A detailed explanation of ufuncs can be found in the \"ufuncs.rst\"\n",
        " |  file in the NumPy reference guide.\n",
        " |  \n",
        " |  Unary ufuncs:\n",
        " |  =============\n",
        " |  \n",
        " |  op(X, out=None)\n",
        " |  Apply op to X elementwise\n",
        " |  \n",
        " |  Parameters\n",
        " |  ----------\n",
        " |  X : array_like\n",
        " |      Input array.\n",
        " |  out : array_like\n",
        " |      An array to store the output. Must be the same shape as `X`.\n",
        " |  \n",
        " |  Returns\n",
        " |  -------\n",
        " |  r : array_like\n",
        " |      `r` will have the same shape as `X`; if out is provided, `r`\n",
        " |      will be equal to out.\n",
        " |  \n",
        " |  Binary ufuncs:\n",
        " |  ==============\n",
        " |  \n",
        " |  op(X, Y, out=None)\n",
        " |  Apply `op` to `X` and `Y` elementwise. May \"broadcast\" to make\n",
        " |  the shapes of `X` and `Y` congruent.\n",
        " |  \n",
        " |  The broadcasting rules are:\n",
        " |  \n",
        " |  * Dimensions of length 1 may be prepended to either array.\n",
        " |  * Arrays may be repeated along dimensions of length 1.\n",
        " |  \n",
        " |  Parameters\n",
        " |  ----------\n",
        " |  X : array_like\n",
        " |      First input array.\n",
        " |  Y : array_like\n",
        " |      Second input array.\n",
        " |  out : array_like\n",
        " |      An array to store the output. Must be the same shape as the\n",
        " |      output would have.\n",
        " |  \n",
        " |  Returns\n",
        " |  -------\n",
        " |  r : array_like\n",
        " |      The return value; if out is provided, `r` will be equal to out.\n",
        " |  \n",
        " |  Methods defined here:\n",
        " |  \n",
        " |  __call__(...)\n",
        " |      x.__call__(...) <==> x(...)\n",
        " |  \n",
        " |  __repr__(...)\n",
        " |      x.__repr__() <==> repr(x)\n",
        " |  \n",
        " |  __str__(...)\n",
        " |      x.__str__() <==> str(x)\n",
        " |  \n",
        " |  accumulate(...)\n",
        " |      accumulate(array, axis=0, dtype=None, out=None)\n",
        " |      \n",
        " |      Accumulate the result of applying the operator to all elements.\n",
        " |      \n",
        " |      For a one-dimensional array, accumulate produces results equivalent to::\n",
        " |      \n",
        " |        r = np.empty(len(A))\n",
        " |        t = op.identity        # op = the ufunc being applied to A's  elements\n",
        " |        for i in xrange(len(A)):\n",
        " |            t = op(t, A[i])\n",
        " |            r[i] = t\n",
        " |        return r\n",
        " |      \n",
        " |      For example, add.accumulate() is equivalent to np.cumsum().\n",
        " |      \n",
        " |      For a multi-dimensional array, accumulate is applied along only one\n",
        " |      axis (axis zero by default; see Examples below) so repeated use is\n",
        " |      necessary if one wants to accumulate over multiple axes.\n",
        " |      \n",
        " |      Parameters\n",
        " |      ----------\n",
        " |      array : array_like\n",
        " |          The array to act on.\n",
        " |      axis : int, optional\n",
        " |          The axis along which to apply the accumulation; default is zero.\n",
        " |      dtype : data-type code, optional\n",
        " |          The data-type used to represent the intermediate results. Defaults\n",
        " |          to the data-type of the output array if such is provided, or the\n",
        " |          the data-type of the input array if no output array is provided.\n",
        " |      out : ndarray, optional\n",
        " |          A location into which the result is stored. If not provided a\n",
        " |          freshly-allocated array is returned.\n",
        " |      \n",
        " |      Returns\n",
        " |      -------\n",
        " |      r : ndarray\n",
        " |          The accumulated values. If `out` was supplied, `r` is a reference to\n",
        " |          `out`.\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      1-D array examples:\n",
        " |      \n",
        " |      >>> np.add.accumulate([2, 3, 5])\n",
        " |      array([ 2,  5, 10])\n",
        " |      >>> np.multiply.accumulate([2, 3, 5])\n",
        " |      array([ 2,  6, 30])\n",
        " |      \n",
        " |      2-D array examples:\n",
        " |      \n",
        " |      >>> I = np.eye(2)\n",
        " |      >>> I\n",
        " |      array([[ 1.,  0.],\n",
        " |             [ 0.,  1.]])\n",
        " |      \n",
        " |      Accumulate along axis 0 (rows), down columns:\n",
        " |      \n",
        " |      >>> np.add.accumulate(I, 0)\n",
        " |      array([[ 1.,  0.],\n",
        " |             [ 1.,  1.]])\n",
        " |      >>> np.add.accumulate(I) # no axis specified = axis zero\n",
        " |      array([[ 1.,  0.],\n",
        " |             [ 1.,  1.]])\n",
        " |      \n",
        " |      Accumulate along axis 1 (columns), through rows:\n",
        " |      \n",
        " |      >>> np.add.accumulate(I, 1)\n",
        " |      array([[ 1.,  1.],\n",
        " |             [ 0.,  1.]])\n",
        " |  \n",
        " |  outer(...)\n",
        " |      outer(A, B)\n",
        " |      \n",
        " |      Apply the ufunc `op` to all pairs (a, b) with a in `A` and b in `B`.\n",
        " |      \n",
        " |      Let ``M = A.ndim``, ``N = B.ndim``. Then the result, `C`, of\n",
        " |      ``op.outer(A, B)`` is an array of dimension M + N such that:\n",
        " |      \n",
        " |      .. math:: C[i_0, ..., i_{M-1}, j_0, ..., j_{N-1}] =\n",
        " |         op(A[i_0, ..., i_{M-1}], B[j_0, ..., j_{N-1}])\n",
        " |      \n",
        " |      For `A` and `B` one-dimensional, this is equivalent to::\n",
        " |      \n",
        " |        r = empty(len(A),len(B))\n",
        " |        for i in xrange(len(A)):\n",
        " |            for j in xrange(len(B)):\n",
        " |                r[i,j] = op(A[i], B[j]) # op = ufunc in question\n",
        " |      \n",
        " |      Parameters\n",
        " |      ----------\n",
        " |      A : array_like\n",
        " |          First array\n",
        " |      B : array_like\n",
        " |          Second array\n",
        " |      \n",
        " |      Returns\n",
        " |      -------\n",
        " |      r : ndarray\n",
        " |          Output array\n",
        " |      \n",
        " |      See Also\n",
        " |      --------\n",
        " |      numpy.outer\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      >>> np.multiply.outer([1, 2, 3], [4, 5, 6])\n",
        " |      array([[ 4,  5,  6],\n",
        " |             [ 8, 10, 12],\n",
        " |             [12, 15, 18]])\n",
        " |      \n",
        " |      A multi-dimensional example:\n",
        " |      \n",
        " |      >>> A = np.array([[1, 2, 3], [4, 5, 6]])\n",
        " |      >>> A.shape\n",
        " |      (2, 3)\n",
        " |      >>> B = np.array([[1, 2, 3, 4]])\n",
        " |      >>> B.shape\n",
        " |      (1, 4)\n",
        " |      >>> C = np.multiply.outer(A, B)\n",
        " |      >>> C.shape; C\n",
        " |      (2, 3, 1, 4)\n",
        " |      array([[[[ 1,  2,  3,  4]],\n",
        " |              [[ 2,  4,  6,  8]],\n",
        " |              [[ 3,  6,  9, 12]]],\n",
        " |             [[[ 4,  8, 12, 16]],\n",
        " |              [[ 5, 10, 15, 20]],\n",
        " |              [[ 6, 12, 18, 24]]]])\n",
        " |  \n",
        " |  reduce(...)\n",
        " |      reduce(a, axis=0, dtype=None, out=None, keepdims=False)\n",
        " |      \n",
        " |      Reduces `a`'s dimension by one, by applying ufunc along one axis.\n",
        " |      \n",
        " |      Let :math:`a.shape = (N_0, ..., N_i, ..., N_{M-1})`.  Then\n",
        " |      :math:`ufunc.reduce(a, axis=i)[k_0, ..,k_{i-1}, k_{i+1}, .., k_{M-1}]` =\n",
        " |      the result of iterating `j` over :math:`range(N_i)`, cumulatively applying\n",
        " |      ufunc to each :math:`a[k_0, ..,k_{i-1}, j, k_{i+1}, .., k_{M-1}]`.\n",
        " |      For a one-dimensional array, reduce produces results equivalent to:\n",
        " |      ::\n",
        " |      \n",
        " |       r = op.identity # op = ufunc\n",
        " |       for i in xrange(len(A)):\n",
        " |         r = op(r, A[i])\n",
        " |       return r\n",
        " |      \n",
        " |      For example, add.reduce() is equivalent to sum().\n",
        " |      \n",
        " |      Parameters\n",
        " |      ----------\n",
        " |      a : array_like\n",
        " |          The array to act on.\n",
        " |      axis : None or int or tuple of ints, optional\n",
        " |          Axis or axes along which a reduction is performed.\n",
        " |          The default (`axis` = 0) is perform a reduction over the first\n",
        " |          dimension of the input array. `axis` may be negative, in\n",
        " |          which case it counts from the last to the first axis.\n",
        " |      \n",
        " |          .. versionadded:: 1.7.0\n",
        " |      \n",
        " |          If this is `None`, a reduction is performed over all the axes.\n",
        " |          If this is a tuple of ints, a reduction is performed on multiple\n",
        " |          axes, instead of a single axis or all the axes as before.\n",
        " |      \n",
        " |          For operations which are either not commutative or not associative,\n",
        " |          doing a reduction over multiple axes is not well-defined. The\n",
        " |          ufuncs do not currently raise an exception in this case, but will\n",
        " |          likely do so in the future.\n",
        " |      dtype : data-type code, optional\n",
        " |          The type used to represent the intermediate results. Defaults\n",
        " |          to the data-type of the output array if this is provided, or\n",
        " |          the data-type of the input array if no output array is provided.\n",
        " |      out : ndarray, optional\n",
        " |          A location into which the result is stored. If not provided, a\n",
        " |          freshly-allocated array is returned.\n",
        " |      keepdims : bool, optional\n",
        " |          If this is set to True, the axes which are reduced are left\n",
        " |          in the result as dimensions with size one. With this option,\n",
        " |          the result will broadcast correctly against the original `arr`.\n",
        " |      \n",
        " |      Returns\n",
        " |      -------\n",
        " |      r : ndarray\n",
        " |          The reduced array. If `out` was supplied, `r` is a reference to it.\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      >>> np.multiply.reduce([2,3,5])\n",
        " |      30\n",
        " |      \n",
        " |      A multi-dimensional array example:\n",
        " |      \n",
        " |      >>> X = np.arange(8).reshape((2,2,2))\n",
        " |      >>> X\n",
        " |      array([[[0, 1],\n",
        " |              [2, 3]],\n",
        " |             [[4, 5],\n",
        " |              [6, 7]]])\n",
        " |      >>> np.add.reduce(X, 0)\n",
        " |      array([[ 4,  6],\n",
        " |             [ 8, 10]])\n",
        " |      >>> np.add.reduce(X) # confirm: default axis value is 0\n",
        " |      array([[ 4,  6],\n",
        " |             [ 8, 10]])\n",
        " |      >>> np.add.reduce(X, 1)\n",
        " |      array([[ 2,  4],\n",
        " |             [10, 12]])\n",
        " |      >>> np.add.reduce(X, 2)\n",
        " |      array([[ 1,  5],\n",
        " |             [ 9, 13]])\n",
        " |  \n",
        " |  reduceat(...)\n",
        " |      reduceat(a, indices, axis=0, dtype=None, out=None)\n",
        " |      \n",
        " |      Performs a (local) reduce with specified slices over a single axis.\n",
        " |      \n",
        " |      For i in ``range(len(indices))``, `reduceat` computes\n",
        " |      ``ufunc.reduce(a[indices[i]:indices[i+1]])``, which becomes the i-th\n",
        " |      generalized \"row\" parallel to `axis` in the final result (i.e., in a\n",
        " |      2-D array, for example, if `axis = 0`, it becomes the i-th row, but if\n",
        " |      `axis = 1`, it becomes the i-th column).  There are two exceptions to this:\n",
        " |      \n",
        " |        * when ``i = len(indices) - 1`` (so for the last index),\n",
        " |          ``indices[i+1] = a.shape[axis]``.\n",
        " |        * if ``indices[i] >= indices[i + 1]``, the i-th generalized \"row\" is\n",
        " |          simply ``a[indices[i]]``.\n",
        " |      \n",
        " |      The shape of the output depends on the size of `indices`, and may be\n",
        " |      larger than `a` (this happens if ``len(indices) > a.shape[axis]``).\n",
        " |      \n",
        " |      Parameters\n",
        " |      ----------\n",
        " |      a : array_like\n",
        " |          The array to act on.\n",
        " |      indices : array_like\n",
        " |          Paired indices, comma separated (not colon), specifying slices to\n",
        " |          reduce.\n",
        " |      axis : int, optional\n",
        " |          The axis along which to apply the reduceat.\n",
        " |      dtype : data-type code, optional\n",
        " |          The type used to represent the intermediate results. Defaults\n",
        " |          to the data type of the output array if this is provided, or\n",
        " |          the data type of the input array if no output array is provided.\n",
        " |      out : ndarray, optional\n",
        " |          A location into which the result is stored. If not provided a\n",
        " |          freshly-allocated array is returned.\n",
        " |      \n",
        " |      Returns\n",
        " |      -------\n",
        " |      r : ndarray\n",
        " |          The reduced values. If `out` was supplied, `r` is a reference to\n",
        " |          `out`.\n",
        " |      \n",
        " |      Notes\n",
        " |      -----\n",
        " |      A descriptive example:\n",
        " |      \n",
        " |      If `a` is 1-D, the function `ufunc.accumulate(a)` is the same as\n",
        " |      ``ufunc.reduceat(a, indices)[::2]`` where `indices` is\n",
        " |      ``range(len(array) - 1)`` with a zero placed\n",
        " |      in every other element:\n",
        " |      ``indices = zeros(2 * len(a) - 1)``, ``indices[1::2] = range(1, len(a))``.\n",
        " |      \n",
        " |      Don't be fooled by this attribute's name: `reduceat(a)` is not\n",
        " |      necessarily smaller than `a`.\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      To take the running sum of four successive values:\n",
        " |      \n",
        " |      >>> np.add.reduceat(np.arange(8),[0,4, 1,5, 2,6, 3,7])[::2]\n",
        " |      array([ 6, 10, 14, 18])\n",
        " |      \n",
        " |      A 2-D example:\n",
        " |      \n",
        " |      >>> x = np.linspace(0, 15, 16).reshape(4,4)\n",
        " |      >>> x\n",
        " |      array([[  0.,   1.,   2.,   3.],\n",
        " |             [  4.,   5.,   6.,   7.],\n",
        " |             [  8.,   9.,  10.,  11.],\n",
        " |             [ 12.,  13.,  14.,  15.]])\n",
        " |      \n",
        " |      ::\n",
        " |      \n",
        " |       # reduce such that the result has the following five rows:\n",
        " |       # [row1 + row2 + row3]\n",
        " |       # [row4]\n",
        " |       # [row2]\n",
        " |       # [row3]\n",
        " |       # [row1 + row2 + row3 + row4]\n",
        " |      \n",
        " |      >>> np.add.reduceat(x, [0, 3, 1, 2, 0])\n",
        " |      array([[ 12.,  15.,  18.,  21.],\n",
        " |             [ 12.,  13.,  14.,  15.],\n",
        " |             [  4.,   5.,   6.,   7.],\n",
        " |             [  8.,   9.,  10.,  11.],\n",
        " |             [ 24.,  28.,  32.,  36.]])\n",
        " |      \n",
        " |      ::\n",
        " |      \n",
        " |       # reduce such that result has the following two columns:\n",
        " |       # [col1 * col2 * col3, col4]\n",
        " |      \n",
        " |      >>> np.multiply.reduceat(x, [0, 3], 1)\n",
        " |      array([[    0.,     3.],\n",
        " |             [  120.,     7.],\n",
        " |             [  720.,    11.],\n",
        " |             [ 2184.,    15.]])\n",
        " |  \n",
        " |  ----------------------------------------------------------------------\n",
        " |  Data descriptors defined here:\n",
        " |  \n",
        " |  identity\n",
        " |      The identity value.\n",
        " |      \n",
        " |      Data attribute containing the identity element for the ufunc, if it has one.\n",
        " |      If it does not, the attribute value is None.\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      >>> np.add.identity\n",
        " |      0\n",
        " |      >>> np.multiply.identity\n",
        " |      1\n",
        " |      >>> np.power.identity\n",
        " |      1\n",
        " |      >>> print np.exp.identity\n",
        " |      None\n",
        " |  \n",
        " |  nargs\n",
        " |      The number of arguments.\n",
        " |      \n",
        " |      Data attribute containing the number of arguments the ufunc takes, including\n",
        " |      optional ones.\n",
        " |      \n",
        " |      Notes\n",
        " |      -----\n",
        " |      Typically this value will be one more than what you might expect because all\n",
        " |      ufuncs take  the optional \"out\" argument.\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      >>> np.add.nargs\n",
        " |      3\n",
        " |      >>> np.multiply.nargs\n",
        " |      3\n",
        " |      >>> np.power.nargs\n",
        " |      3\n",
        " |      >>> np.exp.nargs\n",
        " |      2\n",
        " |  \n",
        " |  nin\n",
        " |      The number of inputs.\n",
        " |      \n",
        " |      Data attribute containing the number of arguments the ufunc treats as input.\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      >>> np.add.nin\n",
        " |      2\n",
        " |      >>> np.multiply.nin\n",
        " |      2\n",
        " |      >>> np.power.nin\n",
        " |      2\n",
        " |      >>> np.exp.nin\n",
        " |      1\n",
        " |  \n",
        " |  nout\n",
        " |      The number of outputs.\n",
        " |      \n",
        " |      Data attribute containing the number of arguments the ufunc treats as output.\n",
        " |      \n",
        " |      Notes\n",
        " |      -----\n",
        " |      Since all ufuncs can take output arguments, this will always be (at least) 1.\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      >>> np.add.nout\n",
        " |      1\n",
        " |      >>> np.multiply.nout\n",
        " |      1\n",
        " |      >>> np.power.nout\n",
        " |      1\n",
        " |      >>> np.exp.nout\n",
        " |      1\n",
        " |  \n",
        " |  ntypes\n",
        " |      The number of types.\n",
        " |      \n",
        " |      The number of numerical NumPy types - of which there are 18 total - on which\n",
        " |      the ufunc can operate.\n",
        " |      \n",
        " |      See Also\n",
        " |      --------\n",
        " |      numpy.ufunc.types\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      >>> np.add.ntypes\n",
        " |      18\n",
        " |      >>> np.multiply.ntypes\n",
        " |      18\n",
        " |      >>> np.power.ntypes\n",
        " |      17\n",
        " |      >>> np.exp.ntypes\n",
        " |      7\n",
        " |      >>> np.remainder.ntypes\n",
        " |      14\n",
        " |  \n",
        " |  signature\n",
        " |  \n",
        " |  types\n",
        " |      Returns a list with types grouped input->output.\n",
        " |      \n",
        " |      Data attribute listing the data-type \"Domain-Range\" groupings the ufunc can\n",
        " |      deliver. The data-types are given using the character codes.\n",
        " |      \n",
        " |      See Also\n",
        " |      --------\n",
        " |      numpy.ufunc.ntypes\n",
        " |      \n",
        " |      Examples\n",
        " |      --------\n",
        " |      >>> np.add.types\n",
        " |      ['??->?', 'bb->b', 'BB->B', 'hh->h', 'HH->H', 'ii->i', 'II->I', 'll->l',\n",
        " |      'LL->L', 'qq->q', 'QQ->Q', 'ff->f', 'dd->d', 'gg->g', 'FF->F', 'DD->D',\n",
        " |      'GG->G', 'OO->O']\n",
        " |      \n",
        " |      >>> np.multiply.types\n",
        " |      ['??->?', 'bb->b', 'BB->B', 'hh->h', 'HH->H', 'ii->i', 'II->I', 'll->l',\n",
        " |      'LL->L', 'qq->q', 'QQ->Q', 'ff->f', 'dd->d', 'gg->g', 'FF->F', 'DD->D',\n",
        " |      'GG->G', 'OO->O']\n",
        " |      \n",
        " |      >>> np.power.types\n",
        " |      ['bb->b', 'BB->B', 'hh->h', 'HH->H', 'ii->i', 'II->I', 'll->l', 'LL->L',\n",
        " |      'qq->q', 'QQ->Q', 'ff->f', 'dd->d', 'gg->g', 'FF->F', 'DD->D', 'GG->G',\n",
        " |      'OO->O']\n",
        " |      \n",
        " |      >>> np.exp.types\n",
        " |      ['f->f', 'd->d', 'g->g', 'F->F', 'D->D', 'G->G', 'O->O']\n",
        " |      \n",
        " |      >>> np.remainder.types\n",
        " |      ['bb->b', 'BB->B', 'hh->h', 'HH->H', 'ii->i', 'II->I', 'll->l', 'LL->L',\n",
        " |      'qq->q', 'QQ->Q', 'ff->f', 'dd->d', 'gg->g', 'OO->O']\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Using custom functions\n",
      "If you can't do it in a one-liner lambda function don't worry. `pandas` also let's `apply` your own custom functions. You can use custom functions when applying on Series and also when operating on chunks of data frames in `groupby`s."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def inverse(x):\n",
      "    return 1 / (x + 1)\n",
      "\n",
      "df.monthly_income.apply(inverse)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "0     0.000110\n",
        "1     0.000384\n",
        "2     0.000329\n",
        "3     0.000016\n",
        "4     0.000286\n",
        "5     0.000286\n",
        "6     0.000042\n",
        "7     0.000400\n",
        "8     0.000154\n",
        "9     0.000080\n",
        "10    1.000000\n",
        "11    0.000088\n",
        "12    0.000114\n",
        "13    0.000305\n",
        "14    0.002994\n",
        "...\n",
        "112400    0.000058\n",
        "112401    0.000241\n",
        "112402    0.000769\n",
        "112403    0.000103\n",
        "112404    0.000146\n",
        "112405    0.000351\n",
        "112406    0.000309\n",
        "112407    0.000371\n",
        "112408    0.000309\n",
        "112409    0.000166\n",
        "112410    0.000073\n",
        "112411    0.000166\n",
        "112412    0.000464\n",
        "112413    0.000265\n",
        "112414    0.000100\n",
        "Name: monthly_income, Length: 112415, dtype: float64"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "####Write a custom function called `cap_value(x, cap)` that will set x to the cap if x > cap. Then apply it to debt_ratio with a cap of 5."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def cap_value(x, cap):\n",
      "    \"\"\"\n",
      "    x - a value\n",
      "    cap - threshold value for x; if x > cap, then x is set to cap\n",
      "    Examples:\n",
      "        cap_value(1000, 10)\n",
      "        10\n",
      "        cap_value(10, 100)\n",
      "        10\n",
      "        \"\"\"\n",
      "    # your code here\n",
      "    return None\n",
      "print cap_value(1000, 10)==10\n",
      "print cap_value(10, 100)==10\n",
      "print df.debt_ratio.apply(lambda x: cap_value(x, 5.0)).mean()#should be close to 1.28"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "False\n",
        "False\n",
        "nan"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Split -> Apply ->Combine\n",
      "Split, Apply, Combine is a data munging methodology similar in spirit to `SQL`'s `GROUP BY`. The idea being you split your data into chunks, operate on those chunks, and then combine the results together into a single table. `groupby` in `pandas` works exactly the same way. But since we're using Python and not SQL, we have a lot more flexibility in terms of the types of operations we can perform in the apply step.\n",
      "\n",
      "From the `pandas` documentation:\n",
      "\n",
      " - Splitting the data into groups based on some criteria\n",
      " - Applying a function to each group independently\n",
      " - Combining the results into a data structure"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Split"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "subset = df[['serious_dlqin2yrs', 'age', 'monthly_income']]\n",
      "subset.groupby(\"serious_dlqin2yrs\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 9,
       "text": [
        "<pandas.core.groupby.DataFrameGroupBy object at 0x103c5b550>"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Apply / Combine\n",
      "Aggregate whatever is returned"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "subset.groupby(\"serious_dlqin2yrs\").mean()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>age</th>\n",
        "      <th>monthly_income</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>serious_dlqin2yrs</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 52.734998</td>\n",
        "      <td> 6424.391821</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 45.966445</td>\n",
        "      <td> 5457.963915</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "                         age  monthly_income\n",
        "serious_dlqin2yrs                           \n",
        "0                  52.734998     6424.391821\n",
        "1                  45.966445     5457.963915"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What's really going on here? You can see below that when you `groupby` a certain variable(s), you're literally splitting the data into chunks based on each possible value of that variable."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for name, group in subset.groupby(\"serious_dlqin2yrs\"):\n",
      "    print \"splitting by: \", name\n",
      "    print group.mean()\n",
      "    print \"*\"*80"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "splitting by:  0\n",
        "serious_dlqin2yrs       0.000000\n",
        "age                    52.734998\n",
        "monthly_income       6424.391821\n",
        "dtype: float64\n",
        "********************************************************************************\n",
        "splitting by:  1\n",
        "serious_dlqin2yrs       1.000000\n",
        "age                    45.966445\n",
        "monthly_income       5457.963915\n",
        "dtype: float64\n",
        "********************************************************************************\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "####Use groupby to calculate the percent of customers that went bad for each age"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###You can also aggregate by multiple functions"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "subset.groupby(\"serious_dlqin2yrs\").agg([np.min, np.mean, np.median, np.max])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr>\n",
        "      <th></th>\n",
        "      <th colspan=\"4\" halign=\"left\">age</th>\n",
        "      <th colspan=\"4\" halign=\"left\">monthly_income</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th></th>\n",
        "      <th>amin</th>\n",
        "      <th>mean</th>\n",
        "      <th>median</th>\n",
        "      <th>amax</th>\n",
        "      <th>amin</th>\n",
        "      <th>mean</th>\n",
        "      <th>median</th>\n",
        "      <th>amax</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>serious_dlqin2yrs</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>  0</td>\n",
        "      <td> 52.734998</td>\n",
        "      <td> 52</td>\n",
        "      <td> 109</td>\n",
        "      <td> 0</td>\n",
        "      <td> 6424.391821</td>\n",
        "      <td> 5250</td>\n",
        "      <td> 3008750</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 21</td>\n",
        "      <td> 45.966445</td>\n",
        "      <td> 46</td>\n",
        "      <td>  99</td>\n",
        "      <td> 0</td>\n",
        "      <td> 5457.963915</td>\n",
        "      <td> 4333</td>\n",
        "      <td>  250000</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "                    age                           monthly_income               \\\n",
        "                   amin       mean  median  amax            amin         mean   \n",
        "serious_dlqin2yrs                                                               \n",
        "0                     0  52.734998      52   109               0  6424.391821   \n",
        "1                    21  45.966445      46    99               0  5457.963915   \n",
        "\n",
        "                                    \n",
        "                   median     amax  \n",
        "serious_dlqin2yrs                   \n",
        "0                    5250  3008750  \n",
        "1                    4333   250000  "
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###`pandas` also let's you use custom apply functions"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def age_x_income(frame):\n",
      "    x = (frame.age * frame.monthly_income)\n",
      "    return np.mean(x)\n",
      "\n",
      "subset.groupby(\"serious_dlqin2yrs\").apply(age_x_income)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "serious_dlqin2yrs\n",
        "0                    343570.264001\n",
        "1                    260374.521838\n",
        "dtype: float64"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Merging and Joining"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pop = pd.read_csv(\"./data/uspop.csv\")\n",
      "pop"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<pre>\n",
        "&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
        "Int64Index: 78 entries, 0 to 77\n",
        "Data columns (total 2 columns):\n",
        "age        78  non-null values\n",
        "est_pop    78  non-null values\n",
        "dtypes: float64(1), int64(1)\n",
        "</pre>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Int64Index: 78 entries, 0 to 77\n",
        "Data columns (total 2 columns):\n",
        "age        78  non-null values\n",
        "est_pop    78  non-null values\n",
        "dtypes: float64(1), int64(1)"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cols = ['age', 'monthly_income', 'serious_dlqin2yrs']\n",
      "result = pd.merge(df[cols] , pop, how='left', on='age')\n",
      "result"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<pre>\n",
        "&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
        "Int64Index: 116157 entries, 0 to 116156\n",
        "Data columns (total 4 columns):\n",
        "age                  116157  non-null values\n",
        "monthly_income       116157  non-null values\n",
        "serious_dlqin2yrs    116157  non-null values\n",
        "est_pop              114660  non-null values\n",
        "dtypes: float64(2), int64(2)\n",
        "</pre>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Int64Index: 116157 entries, 0 to 116156\n",
        "Data columns (total 4 columns):\n",
        "age                  116157  non-null values\n",
        "monthly_income       116157  non-null values\n",
        "serious_dlqin2yrs    116157  non-null values\n",
        "est_pop              114660  non-null values\n",
        "dtypes: float64(2), int64(2)"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(result) > len(df)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "True"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pd.value_counts(pop.age).head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": [
        "29    2\n",
        "59    2\n",
        "36    1\n",
        "30    1\n",
        "31    1\n",
        "dtype: int64"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pop = pop[pop.age.duplicated()==False]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cols = ['age', 'monthly_income', 'serious_dlqin2yrs']\n",
      "joined = pd.merge(df[cols] , pop, how='left', on='age')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 19
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pop.tail()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>age</th>\n",
        "      <th>est_pop</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>73</th>\n",
        "      <td> 81</td>\n",
        "      <td> 5672.346939</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>74</th>\n",
        "      <td> 82</td>\n",
        "      <td> 5656.795918</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>75</th>\n",
        "      <td> 83</td>\n",
        "      <td> 5641.244898</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>76</th>\n",
        "      <td> 84</td>\n",
        "      <td> 5625.693878</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>77</th>\n",
        "      <td> 85</td>\n",
        "      <td> 4957.000000</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "    age      est_pop\n",
        "73   81  5672.346939\n",
        "74   82  5656.795918\n",
        "75   83  5641.244898\n",
        "76   84  5625.693878\n",
        "77   85  4957.000000"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "joined.est_pop = joined.est_pop.fillna(4957.0)\n",
      "joined.est_pop.describe()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 21,
       "text": [
        "count    112415.000000\n",
        "mean      17879.408997\n",
        "std        4874.574536\n",
        "min        4957.000000\n",
        "25%       16892.244898\n",
        "50%       19467.306122\n",
        "75%       21510.408163\n",
        "max       21988.020408\n",
        "dtype: float64"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##`pandasql`\n",
      "Training wheels for `pandas`. We developed and open sourced `pandasql` to help people coming from other languages ease into the `pandas` syntax. It allows you to query `pandas` data frames like they were `SQL` tables."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pandasql import sqldf\n",
      "pysqldf = lambda q: sqldf(q, globals())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "query = \"\"\"\n",
      "select\n",
      "    serious_dlqin2yrs\n",
      "    , sum(1) as total\n",
      "from\n",
      "    df\n",
      "group by\n",
      "    serious_dlqin2yrs;\n",
      "\"\"\"\n",
      "pysqldf(query)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>serious_dlqin2yrs</th>\n",
        "      <th>total</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 0</td>\n",
        "      <td> 104905</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 1</td>\n",
        "      <td>   7510</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "   serious_dlqin2yrs   total\n",
        "0                  0  104905\n",
        "1                  1    7510"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "query = \"\"\"\n",
      "select\n",
      "    age\n",
      "    , avg(serious_dlqin2yrs) as pct_delinquent\n",
      "from\n",
      "    df\n",
      "group by\n",
      "    age\n",
      "order by\n",
      "    age;\n",
      "\"\"\"\n",
      "pysqldf(query)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<pre>\n",
        "&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
        "Int64Index: 86 entries, 0 to 85\n",
        "Data columns (total 2 columns):\n",
        "age               86  non-null values\n",
        "pct_delinquent    86  non-null values\n",
        "dtypes: float64(2)\n",
        "</pre>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 24,
       "text": [
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Int64Index: 86 entries, 0 to 85\n",
        "Data columns (total 2 columns):\n",
        "age               86  non-null values\n",
        "pct_delinquent    86  non-null values\n",
        "dtypes: float64(2)"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##We just did the following\n",
      "\n",
      "- Used `apply` to make custom data transformations\n",
      "- Did `groupby` and aggregate operations using `pandas`\n",
      "- Used both `pandas` and `pandasql` to merge data frames together"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}